{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluating Opik's Hallucination Metric\n",
    "\n",
    "For this guide we will be evaluating the Hallucination metric included in the LLM Evaluation SDK which will showcase both how to use the `evaluation` functionality in the platform as well as the quality of the Hallucination metric included in the SDK."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating an account on Comet.com\n",
    "\n",
    "[Comet](https://www.comet.com/site/?from=llm&utm_source=opik&utm_medium=colab&utm_content=eval_hall&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=eval_hall&utm_campaign=opik) and grab your API Key.\n",
    "\n",
    "> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik&utm_medium=colab&utm_content=eval_hall&utm_campaign=opik) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install opik pyarrow pandas fsspec huggingface_hub --upgrade --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import opik\n",
    "\n",
    "opik.configure(use_local=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing our environment\n",
    "\n",
    "First, we will install configure the OpenAI API key and create a new Opik dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "if \"OPENAI_API_KEY\" not in os.environ:\n",
    "    os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be using the [HaluEval dataset](https://huggingface.co/datasets/pminervini/HaluEval?library=pandas) which according to this [paper](https://arxiv.org/pdf/2305.11747) ChatGPT detects 86.2% of hallucinations. The first step will be to create a dataset in the platform so we can keep track of the results of the evaluation.\n",
    "\n",
    "Since the insert methods in the SDK deduplicates items, we can insert 50 items and if the items already exist, Opik will automatically remove them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dataset\n",
    "import opik\n",
    "import pandas as pd\n",
    "\n",
    "client = opik.Opik()\n",
    "\n",
    "# Create dataset\n",
    "dataset = client.get_or_create_dataset(name=\"HaluEval\", description=\"HaluEval dataset\")\n",
    "\n",
    "# Insert items into dataset\n",
    "df = pd.read_parquet(\n",
    "    \"hf://datasets/pminervini/HaluEval/general/data-00000-of-00001.parquet\"\n",
    ")\n",
    "df = df.sample(n=50, random_state=42)\n",
    "\n",
    "dataset_records = [\n",
    "    {\n",
    "        \"input\": x[\"user_query\"],\n",
    "        \"llm_output\": x[\"chatgpt_response\"],\n",
    "        \"expected_hallucination_label\": x[\"hallucination\"],\n",
    "    }\n",
    "    for x in df.to_dict(orient=\"records\")\n",
    "]\n",
    "\n",
    "dataset.insert(dataset_records)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the hallucination metric\n",
    "\n",
    "In order to evaluate the performance of the Opik hallucination metric, we will define:\n",
    "\n",
    "- Evaluation task: Our evaluation task will use the data in the Dataset to return a hallucination score computed using the Opik hallucination metric.\n",
    "- Scoring metric: We will use the `Equals` metric to check if the hallucination score computed matches the expected output.\n",
    "\n",
    "By defining the evaluation task in this way, we will be able to understand how well Opik's hallucination metric is able to detect hallucinations in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from opik.evaluation.metrics import Hallucination, Equals\n",
    "from opik.evaluation import evaluate\n",
    "from opik import Opik\n",
    "from opik.evaluation.metrics.llm_judges.hallucination.template import generate_query\n",
    "from typing import Dict\n",
    "\n",
    "\n",
    "# Define the evaluation task\n",
    "def evaluation_task(x: Dict):\n",
    "    metric = Hallucination()\n",
    "    try:\n",
    "        metric_score = metric.score(input=x[\"input\"], output=x[\"llm_output\"])\n",
    "        hallucination_score = metric_score.value\n",
    "        hallucination_reason = metric_score.reason\n",
    "    except Exception as e:\n",
    "        print(e)\n",
    "        hallucination_score = None\n",
    "        hallucination_reason = str(e)\n",
    "\n",
    "    return {\n",
    "        \"hallucination_score\": \"yes\" if hallucination_score == 1 else \"no\",\n",
    "        \"hallucination_reason\": hallucination_reason,\n",
    "    }\n",
    "\n",
    "\n",
    "# Get the dataset\n",
    "client = Opik()\n",
    "dataset = client.get_dataset(name=\"HaluEval\")\n",
    "\n",
    "# Define the scoring metric\n",
    "check_hallucinated_metric = Equals(name=\"Correct hallucination score\")\n",
    "\n",
    "# Add the prompt template as an experiment configuration\n",
    "experiment_config = {\n",
    "    \"prompt_template\": generate_query(\n",
    "        input=\"{input}\", context=\"{context}\", output=\"{output}\", few_shot_examples=[]\n",
    "    )\n",
    "}\n",
    "\n",
    "res = evaluate(\n",
    "    dataset=dataset,\n",
    "    task=evaluation_task,\n",
    "    scoring_metrics=[check_hallucinated_metric],\n",
    "    experiment_config=experiment_config,\n",
    "    scoring_key_mapping={\n",
    "        \"reference\": \"expected_hallucination_label\",\n",
    "        \"output\": \"hallucination_score\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the hallucination metric is able to detect ~80% of the hallucinations contained in the dataset and we can see the specific items where hallucinations were not detected.\n",
    "\n",
    "![Hallucination Evaluation](https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/fern/img/cookbook/hallucination_metric_cookbook.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
